10 research outputs found

    REX: Recursive, Delta-Based Data-Centric Computation

    Full text link
    In today's Web and social network environments, query workloads include ad hoc and OLAP queries, as well as iterative algorithms that analyze data relationships (e.g., link analysis, clustering, learning). Modern DBMSs support ad hoc and OLAP queries, but most are not robust enough to scale to large clusters. Conversely, "cloud" platforms like MapReduce execute chains of batch tasks across clusters in a fault tolerant way, but have too much overhead to support ad hoc queries. Moreover, both classes of platform incur significant overhead in executing iterative data analysis algorithms. Most such iterative algorithms repeatedly refine portions of their answers, until some convergence criterion is reached. However, general cloud platforms typically must reprocess all data in each step. DBMSs that support recursive SQL are more efficient in that they propagate only the changes in each step -- but they still accumulate each iteration's state, even if it is no longer useful. User-defined functions are also typically harder to write for DBMSs than for cloud platforms. We seek to unify the strengths of both styles of platforms, with a focus on supporting iterative computations in which changes, in the form of deltas, are propagated from iteration to iteration, and state is efficiently updated in an extensible way. We present a programming model oriented around deltas, describe how we execute and optimize such programs in our REX runtime system, and validate that our platform also handles failures gracefully. We experimentally validate our techniques, and show speedups over the competing methods ranging from 2.5 to nearly 100 times.Comment: VLDB201

    A scalable approach to complex computations

    No full text
    We are witnessing a dramatic increase in the amount of data: environmental readings, web pages, social networks, medical records, genome sequences, etc. Special acquisition strategies and complex analysis are needed in order to detect underlying structure in data, make predictions, learn generalizations, or test hypotheses. From a database perspective, declarative computations are the main mechanism to enable such operations in a scalable, efficient, and easy to use way. The main challenge today is to build widely applicable distributed database systems which, while leveraging existing database query and network-level optimization techniques, integrate novel approaches to better adapt to and exploit the properties of their particular computing environment. The work in this thesis seeks to expand the limits of what can be computed in environments spanning low-powered sensor devices to cloud server machines. It broadly focuses on making database systems better suited for new computation environments and emerging problems through judicious integration of decades of research with novel techniques. This thesis proposes using a unified programming model for both sensor and cloud computing parallel database systems, allowing for easy data integration and cross-optimizations as exemplified by the ASPEN system. In the sensor network domain in particular, a novel routing substrate for sensor networks is developed, aimed at efficient minimal-impact exploration of the underlying network graph. A framework for join cost modeling is developed, taking into account the physical topology and relative size and item frequency of streams being aggregated or joined. In the cloud domain, a novel delta-based approach for recursive computations proves to be particularly efficient at tackling a variety of distributed machine learning algorithms. We incorporate this new model into a parallel database system with an efficient incremental failure handling capability and a sophisticated query optimizer to demonstrate substantial performance gains compared to traditional approaches

    A scalable approach to complex computations

    No full text
    We are witnessing a dramatic increase in the amount of data: environmental readings, web pages, social networks, medical records, genome sequences, etc. Special acquisition strategies and complex analysis are needed in order to detect underlying structure in data, make predictions, learn generalizations, or test hypotheses. From a database perspective, declarative computations are the main mechanism to enable such operations in a scalable, efficient, and easy to use way. The main challenge today is to build widely applicable distributed database systems which, while leveraging existing database query and network-level optimization techniques, integrate novel approaches to better adapt to and exploit the properties of their particular computing environment. The work in this thesis seeks to expand the limits of what can be computed in environments spanning low-powered sensor devices to cloud server machines. It broadly focuses on making database systems better suited for new computation environments and emerging problems through judicious integration of decades of research with novel techniques. This thesis proposes using a unified programming model for both sensor and cloud computing parallel database systems, allowing for easy data integration and cross-optimizations as exemplified by the ASPEN system. In the sensor network domain in particular, a novel routing substrate for sensor networks is developed, aimed at efficient minimal-impact exploration of the underlying network graph. A framework for join cost modeling is developed, taking into account the physical topology and relative size and item frequency of streams being aggregated or joined. In the cloud domain, a novel delta-based approach for recursive computations proves to be particularly efficient at tackling a variety of distributed machine learning algorithms. We incorporate this new model into a parallel database system with an efficient incremental failure handling capability and a sophisticated query optimizer to demonstrate substantial performance gains compared to traditional approaches

    A Substrate for In-Network Sensor Data Integration

    No full text
    With the ultimate goal of extending the data integration paradigm and query processing capabilities to ad hoc wireless networks, sensors, and stream systems, we consider how to support communication between sets of nodes performing distributed joins in sensor networks. We develop a communication model that enables in-network join at a variety of locations, and which facilitates coordination among nodes in order to make optimization decisions. While we defer a discussion of the optimizer to future work, we experimentally compare a variety of strategies, including at-base and in-network joins. Results show significant performance gains versus prior work, as well as opportunities for optimization

    A Substrate for In-Network Sensor Data Integration

    Get PDF
    With the ultimate goal of extending the data integration paradigm and query processing capabilities to ad hoc wireless networks, sensors, and stream systems, we consider how to support communication between sets of nodes performing distributed joins in sensor networks. We develop a communication model that enables in-network join at a variety of locations, and which facilitates coordination among nodes in order to make optimization decisions. While we defer a discussion of the optimizer to future work, we experimentally compare a variety of strategies, including at-base and in-network joins. Results show significant performance gains versus prior work, as well as opportunities for optimization

    Dynamic join optimization in multihop wireless sensor networks

    Get PDF
    To enable smart environments and self-tuning data centers, we are developing the Aspen system for integrating physical sensor data, as well as stream data coming from machine logical state, and database or Web data from the Internet. A key component of this system is a query processor optimized for limited-bandwidth, possibly battery-powered devices with multiple hop wireless radio communications. This query processor is given a portion of a data integration query, possibly including joins among sensors, to execute. Several recent papers have developed techniques for computing joins in sensors, but these techniques are static and are only appropriate for specific join selectivity ratios. We consider the problem of dynamic join optimization for sensor networks, developing solutions that employ cost modeling, as well as adaptive learning and self-tuning heuristics to choose the best algorithm under real and variable selectivity values. We focus on in-network join computation, but our architecture extends to other approaches (and we compare against these). We develop basic techniques assuming selectivities are uniform and known in advance, and optimization can be done on a pairwise basis; we then extend the work to handle joins between multiple pairs, when selectivities are not fully known. We experimentally validate our work at scale using standard datasets. 1
    corecore